Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

OCR correction and query expansion for retrieval on OCR data : CLARIT TREC-5 confusion track report

Identifieur interne : 002588 ( Main/Exploration ); précédent : 002587; suivant : 002589

OCR correction and query expansion for retrieval on OCR data : CLARIT TREC-5 confusion track report

Auteurs : XIANG TONG [États-Unis] ; CHENGXIANG ZHAI [États-Unis] ; N. Milic-Frayling [États-Unis] ; D. A. Evans [États-Unis]

Source :

RBID : Pascal:98-0270910

Descripteurs français

English descriptors

Abstract

In CLARIT TREC-5 confusion track experiments, they explored two techniques for improving retrieval performance over corrupted data : (1) OCR word error correction to improve OCR text accuracy, and (2) query expansion by adding query term variants found in the corrupted text. The OCR word correction technique is based on statistical word bigram modeling (Tong & Evans 1996). The variants of a query term are terms similar to the query term, as measured by the edit distance (Wagner 1974). While the official runs were based on the first approach, in the follow-up experiments they tested the second approach as well. In this report, they give a brief description of the OCR correction and query expansion techniques, and then discuss the results of the experiments


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">OCR correction and query expansion for retrieval on OCR data : CLARIT TREC-5 confusion track report</title>
<author>
<name sortKey="Xiang Tong" sort="Xiang Tong" uniqKey="Xiang Tong" last="Xiang Tong">XIANG TONG</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Laboratory for Computational Linguistics, Carnegie Mellon University</s1>
<s2>Pittsburgh, PA 15213</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Chengxiang Zhai" sort="Chengxiang Zhai" uniqKey="Chengxiang Zhai" last="Chengxiang Zhai">CHENGXIANG ZHAI</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Laboratory for Computational Linguistics, Carnegie Mellon University</s1>
<s2>Pittsburgh, PA 15213</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Milic Frayling, N" sort="Milic Frayling, N" uniqKey="Milic Frayling N" first="N." last="Milic-Frayling">N. Milic-Frayling</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>CLARITECH Corporation, 5301 Fifth Ave.</s1>
<s2>Pittsburgh, PA 15232-2124</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Evans, D A" sort="Evans, D A" uniqKey="Evans D" first="D. A." last="Evans">D. A. Evans</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>CLARITECH Corporation, 5301 Fifth Ave.</s1>
<s2>Pittsburgh, PA 15232-2124</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">98-0270910</idno>
<date when="1997">1997</date>
<idno type="stanalyst">PASCAL 98-0270910 INIST</idno>
<idno type="RBID">Pascal:98-0270910</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000885</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000B12</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000893</idno>
<idno type="wicri:doubleKey">1048-776X:1997:Xiang Tong:ocr:correction:and</idno>
<idno type="wicri:Area/Main/Merge">002724</idno>
<idno type="wicri:Area/Main/Curation">002588</idno>
<idno type="wicri:Area/Main/Exploration">002588</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">OCR correction and query expansion for retrieval on OCR data : CLARIT TREC-5 confusion track report</title>
<author>
<name sortKey="Xiang Tong" sort="Xiang Tong" uniqKey="Xiang Tong" last="Xiang Tong">XIANG TONG</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Laboratory for Computational Linguistics, Carnegie Mellon University</s1>
<s2>Pittsburgh, PA 15213</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Chengxiang Zhai" sort="Chengxiang Zhai" uniqKey="Chengxiang Zhai" last="Chengxiang Zhai">CHENGXIANG ZHAI</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Laboratory for Computational Linguistics, Carnegie Mellon University</s1>
<s2>Pittsburgh, PA 15213</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
<settlement type="city">Pittsburgh</settlement>
</placeName>
<orgName type="university">Université Carnegie-Mellon</orgName>
</affiliation>
</author>
<author>
<name sortKey="Milic Frayling, N" sort="Milic Frayling, N" uniqKey="Milic Frayling N" first="N." last="Milic-Frayling">N. Milic-Frayling</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>CLARITECH Corporation, 5301 Fifth Ave.</s1>
<s2>Pittsburgh, PA 15232-2124</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Evans, D A" sort="Evans, D A" uniqKey="Evans D" first="D. A." last="Evans">D. A. Evans</name>
<affiliation wicri:level="2">
<inist:fA14 i1="02">
<s1>CLARITECH Corporation, 5301 Fifth Ave.</s1>
<s2>Pittsburgh, PA 15232-2124</s2>
<s3>USA</s3>
<sZ>3 aut.</sZ>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Pennsylvanie</region>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">NIST special publication</title>
<title level="j" type="abbreviated">NIST spec. publ.</title>
<idno type="ISSN">1048-776X</idno>
<imprint>
<date when="1997">1997</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">NIST special publication</title>
<title level="j" type="abbreviated">NIST spec. publ.</title>
<idno type="ISSN">1048-776X</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Automated processing</term>
<term>Automatic correction</term>
<term>Data</term>
<term>Degradation</term>
<term>Information retrieval</term>
<term>Optical character recognition</term>
<term>Query</term>
<term>Question processing</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Recherche information</term>
<term>Reconnaissance optique caractère</term>
<term>Correction automatique</term>
<term>Question documentaire</term>
<term>Traitement automatisé</term>
<term>Dégradation</term>
<term>Donnée</term>
<term>CLARIT</term>
<term>Elargissement question</term>
<term>Traitement question</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">In CLARIT TREC-5 confusion track experiments, they explored two techniques for improving retrieval performance over corrupted data : (1) OCR word error correction to improve OCR text accuracy, and (2) query expansion by adding query term variants found in the corrupted text. The OCR word correction technique is based on statistical word bigram modeling (Tong & Evans 1996). The variants of a query term are terms similar to the query term, as measured by the edit distance (Wagner 1974). While the official runs were based on the first approach, in the follow-up experiments they tested the second approach as well. In this report, they give a brief description of the OCR correction and query expansion techniques, and then discuss the results of the experiments</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Pennsylvanie</li>
</region>
<settlement>
<li>Pittsburgh</li>
</settlement>
<orgName>
<li>Université Carnegie-Mellon</li>
</orgName>
</list>
<tree>
<country name="États-Unis">
<region name="Pennsylvanie">
<name sortKey="Xiang Tong" sort="Xiang Tong" uniqKey="Xiang Tong" last="Xiang Tong">XIANG TONG</name>
</region>
<name sortKey="Chengxiang Zhai" sort="Chengxiang Zhai" uniqKey="Chengxiang Zhai" last="Chengxiang Zhai">CHENGXIANG ZHAI</name>
<name sortKey="Evans, D A" sort="Evans, D A" uniqKey="Evans D" first="D. A." last="Evans">D. A. Evans</name>
<name sortKey="Milic Frayling, N" sort="Milic Frayling, N" uniqKey="Milic Frayling N" first="N." last="Milic-Frayling">N. Milic-Frayling</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002588 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002588 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:98-0270910
   |texte=   OCR correction and query expansion for retrieval on OCR data : CLARIT TREC-5 confusion track report
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024